Text Bundling: Statistics Based Data-Reduction

نویسندگان

  • Lawrence Shih
  • Jason D. M. Rennie
  • Yu-Han Chang
  • David R. Karger
چکیده

As text corpora become larger, tradeoffs between speed and accuracy become critical: slow but accurate methods may not complete in a practical amount of time. In order to make the training data a manageable size, a data reduction technique may be necessary. Subsampling, for example, speeds up a classifier by randomly removing training points. In this paper, we describe an alternate method for reducing the number of training points by combining training points such that important statistical information is retained. Our algorithm keeps the same statistics that fast, linear-time text algorithms like Rocchio and Naive Bayes use. We provide empirical results that show our data reduction technique compares favorably to three other data reduction techniques on four standard text corpora.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Off-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model

In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...

متن کامل

Data Quality in the Therapeutic Abortion Survey

The Therapeutic Abortion Survey was originally designed to measure all legal, induced abortions performed in Canada (and to the extent possible, abortions performed in the US on Canadian residents prior to 2004). Statistics Canada was responsible for the survey up to the 1994 data year. As of the 1995 data year, the Canadian Institute for Health Information (CIHI) assumed data collection, compi...

متن کامل

Edge Bundling in Information Visualization

The edge, which can encode relational data in graphs and multidimensional data in parallel coordinates plots, is an important visual primitive for encoding data in information visualization research. However, when data become very large, visualizations often suffer from visual clutter as thousands of edges can easily overwhelm the display and obscure underlying patterns. Many edge-bundling tech...

متن کامل

The Performance Comparison among Product-Bundling Strategies Based on Different Online Behavior Data

For a very popular sales promotion tool, bundling selling, a critical issue is to decide what products can be bundled together in order to have better sales performance. Traditionally, such decision is often based on the order data collected from the POS. However, the new power of the Internet marketing allows marketers to collect not only the order data but also the browsing data and the shopp...

متن کامل

Sell by bundle or unit?: Pure bundling versus mixed bundling of information goods

a School of Information Systems, Curtin University, Australia b Graduate School of Business, Seoul National University, 599 Gwanangno, Daehakdong, Gwanakgu, Seoul 151-916, Republic of Korea c The Paul Merage School of Business, University of California, Irvine, United States d Department of Information Systems, Business Statistics, and Operations Management, Hong Kong University of Science and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003